An Introduction Using R and rvest.
rvest package“[W]eb scraping is the practice of gathering data through any means other than a program interacting with an API (or, obviously, through a human using a web browser). This is most commonly accomplished by writing an automated program that queries a web server, requests data (usually in the form of HTML and other files that compose web pages), and then parses that data to extract needed information.”
— Ryan Mitchell, Web Scraping with Python
polite package)Once we’ve identified the URL of the page we’d like to scrape, the process of extracting data follows three general steps:
Here’s a simple example adapted from the rvest documentation:
library(rvest)
# Fetch the html source code for the imbd page for The Lego Movie
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
# Extract the average user rating
rating <- lego_movie %>% # Start with the html source
html_nodes("strong span") %>% # Use a CSS selector to find the right section
html_text() %>% # Extract text from the selected section
as.numeric() # Format the selected text as numeric
rating
[1] 7.8
Cheese.com is a fantastic website with a database of 1,829 varieties of cheese. They’re also active on twitter!
Cheese 🧀of the day: RICOTTA DI BUFALA🧀#RICOTTADIBUFALA🧀 #PasteurizedCheese #WaterBuffaloMilk🐃 #ItalianCheese #FreshFirmCheese #CreamyTexture https://t.co/V6ur79q4xL pic.twitter.com/zPayyscPOb
— Cheesedotcom (@Cheesedotcom) December 20, 2019
In the rest of the presentation, I’ll step through how to scrape their website to extract some important cheese data for analysis.1
We would eventually like to scrape data on all of the cheeses in the cheese.com database. However, it’s often helpful to start from the specific and then generalize from there.
Let’s start by extracting data from the page for brunost, a very tasty Norwegian cheese made from cow and goat’s milk.2 It appears that the page for any particular cheese can be found by navigating to https://www.cheese.com/*cheese name*/, so we’ll read the HTML for https://www.cheese.com/brunost/:
brunost_url <- "https://www.cheese.com/brunost"
(brunost_html <- read_html(brunost_url))
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; cha ...
[2] <body>\n <!-- Top Banner -->\n <div id="top-banner">\n ...
HTML elements have properties (e.g., type, class, id, attribute). For example, the list of cheese facts that appear alongside the photo are contained an unordered list (ul) element with the class "summary-points". The first list item (li) contains a paragraph (p) element that tells what kind of milk the cheese is made from.
<ul class="summary-points">
<li><i class="fa fa-flask" aria-hidden="true"></i>
<p>Made from pasteurized or unpasteurized <a href="/by_milk/?m=cow">cow</a>'s
and <a href="/by_milk/?m=goat">goat</a>'s milk</p>
</li>
...
</ul>
We can use this structure to create rules about what to extract from the page source.
CSS selectors are patterns used to select HTML elements. I won’t cover the entire variety of CSS selectors, but luckily you can learn (almost) everything you need to know about CSS selectors by playing this game or by consulting this reference page.
Exercise: Complete the first ten levels of CSS Diner.
Let’s try to select the entire unordered list element with the class summary-points (there’s only one). The CSS selector ul.summary-points should do this for us.
brunost_html %>% # start w/ the html we read in earlier
html_nodes(css = "ul.summary-points") %>% # extract element w/ CSS selector
html_text() # extract text from the element
[1] "Made from pasteurized or unpasteurized cow's and goat's milk\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tCountry of origin: Denmark, Finland, Germany, Iceland, Norway and Sweden\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tType: semi-soft, whey\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tFat content: 27 g/100g\n\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tCalcium content: 360 mg/100g\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tTexture: dense\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tRind: natural\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tColour: brown\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tFlavour: caramel, sweet\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tProducers: Tine\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\tSynonyms: mysost, mesost, meesjuusto, mysuostur, myseost, Braunkäse, geitost, Ekte Geitost, Gudbrandsdalsost\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t"
This gets us what we want (the types of milk used to make the cheese) but much else besides. We need to refine our selection to separate the curds from the whey, so to speak.3
We can see that the words we want to extract (cow and goat) are contained in hyperlink elements (a) within the list item element with two classes, fa and fa-flask. The second class seems to be the more exclusive of the two, so we’ll try to select these hyperlink elements using .fa-flask + p a. We can read this as "select every a element in the first p element after an element with class fa-flask.
brunost_html %>%
html_nodes(css = ".fa-flask + p a") %>%
html_text()
[1] "cow" "goat"
It worked! If it seems difficult to come up with the right CSS selector, not to worry — there’s an excellent tool called (Selector Gadget)[https://selectorgadget.com/] that will create CSS selectors for you.